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Abstract 

The mutual information of two random variables 2 and j with joint probabilities 
{7 Tij} is commonly used in learning Bayesian nets as well as in many other fields. 
The chances 7are usually estimated by the empirical sampling frequency riij/n 
leading to a point estimate I{riij / n) for the mutual information. To answer questions 
like “is I (riij/n) consistent with zero?” or “what is the probability that the true 
mutual information is much larger than the point estimate?” one has to go beyond 
the point estimate. In the Bayesian framework one can answer these questions 
by utilizing a (second order) prior distribution p(ir) comprising prior information 
about 7T. From the prior p(n) one can compute the posterior p(7ijn), from which the 
distribution p{I |n) of the mutual information can be calculated. We derive reliable 
and quickly computable approximations for p(/|n). We concentrate on the mean, 
variance, skewness, and kurtosis, and non-informative priors. For the mean we also 
give an exact expression. Numerical issues and the range of validity are discussed. 







1 Introduction 


The mutual information / (also called cross entropy) is a widely used information theoretic 
measure for the stochastic dependency of random variables ||CT91| , |Soo00|| . It is used, 
for instance, in learning Bayesian nets [ |Bun96| . [Hcc98|| , where stochastically dependent 
nodes shall be connected. The mutual information defined in (HD can be computed if the 
joint probabilities of the two random variables i and j are known. The standard 

procedure in the common case of unknown chances is to use the sample frequency 
estimates ^ instead, as if they were precisely known probabilities; but this is not always 
appropriate. Furthermore, the point estimate /(^a) gives no clue about the reliability of 
the value if the sample size n is finite. For instance, for independent i and j, /( n) = 0 
but /(yy) =0(n~ 1 / 2 ) due to noise in the data. The criterion for judging dependency is 
how many standard deviations /(—) is away from zero. In [|KJ96|, |Klc99|| the probability 
that the true I(jt) is greater than a given threshold has been used to construct Bayesian 
nets. In the Bayesian framework one can answer these questions by utilizing a (second 
order) prior distribution p(7r),which takes account of any impreciseness about n. From 
the prior p( 7r) one can compute the posterior p(7r|n), from which the distribution p(I |n) 
of the mutual information can be obtained. 

The objective of this work is to derive reliable and quickly computable analytical ex¬ 
pressions for p{I |n). Section |2] introduces the mutual information distribution, Section 
discusses some results in advance before delving into the derivation. Since the central 
limit theorem ensures that p(I |n) converges to a Gaussian distribution a good starting 
point is to compute the mean and variance of p(I |n). In section [| we relate the mean 
and variance to the covariance structure of p(7ijn). Most non-informative priors lead to 
a Dirichlet posterior. An exact expression for the mean (Section HD an d approximate 
expressions for the variance (Sections H|) are given for the Dirichlet distribution. More 
accurate estimates of the variance and higher central moments are derived in Section [7], 
which lead to good approximations of p(I |n) even for small sample sizes. We show that 
the expressions obtained in [ |KJ96| , |Klc99|| by heuristic numerical methods are incorrect. 
Numerical issues and the range of validity are briefly discussed in section 0. 


2 Mutual Information Distribution 

We consider discrete random variables *e{l,...,r} and j€{l,...,s} and an i.i.d. random 
process with samples (i,j) G { 1,...,r} x drawn with joint probability 7r^-. An im¬ 

portant measure of the stochastic dependence of i and j is the mutual information 

r s 

i(n) = XX lo s - = X! *ij i°g nj - X n i+ lQ g 7r *+ - X n +j lQ g n +j ■ (!) 

i= 1 j= 1 TTj+TT+j ij j j 

log denotes the natural logarithm and 7r i+ = and 7r + j = are marginal proba¬ 

bilities. Often one does not know the probabilities 7ly,- exactly, but one has a sample set 
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with riij outcomes of pair (i,j). The frequency ify := — may be used as a first estimate 
of the unknown probabilities. n\=J2ij n ij is the total sample size. This leads to a point 
(frequency) estimate I (fr) = X^j~dog ~~~ for the mutual information (per sample). 

Unfortunately the point estimation /(ft) gives no information about its accuracy. In 
the Bayesian approach to this problem one assumes a prior (second order) probability 
density p{tt) for the unknown probabilities 7 q,,- on the probability simplex. From this one 
can compute the posterior distribution p(7r|n) ocp(7r)np7h/ 3 (the are multinomially 
distributed). This allows to compute the posterior probability density of the mutual 
information^ 

p(I |n) = J — I)p(TT\n)d rs TT (2) 

0The 5() distribution restricts the integral to n for which I(jr)—I. For large sample size 
n—► 00 , p(i r|n) is strongly peaked around n=n and p(I |n) gets strongly peaked around 
the frequency estimate / = /(7r). The mean E[I\= f£ c lp(l\n)dl = JI(7i)p(7r\n)d rs 7i and 
the variance Var[/] =E[(I — E[I]) 2 ] =E[I 2 } — E[I] 2 are of central interest. 


3 Results for I under the Dirichlet P(oste)rior 


Mostf] non-informative priors for p(n) lead to a Dirichlet posterior distribution p(7r|n) oc 
II 1 with interpretation n l3 — n' l3 + n" , where n[ 3 are the number of samples (i,j), 
and n" 3 comprises prior information (1 for the uniform prior, | for Jeffreys’ prior, 0 for 


Haldane’s prior, f- for Perks’ prior ||GCSR95|] ). In principle this allows to compute the 


posterior density p(I |n) of the mutual information. In sections Q and |5] we expand the 
mean and variance in terms of n _1 : 


£[/] 
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The first term for the mean is just the point estimate /(if). The second term is a small 
correction if n^>r-s. Kleiter 1[KJ96 . |Klc99|| determined the correction by Monte Carlo 
studies as min{^f-,^f}. This is wrong unless s or r are 2. The expression 2 E[I]/n they 
determined for the variance has a completely different structure than ours. Note that the 
mean is lower bounded by co » 5t - -|-0(?z~ 2 ), which is strictly positive for large, but finite 

1 /( n) denotes the mutual information for the specific chances 7 r, whereas / in the context above is 
just some non-negative real number. I will also denote the mutual information random variable in the 
expectation E[I] and variance Var[/]. Expectaions are always w.r.t. to the posterior distribution p(-7r|n). 

2 Since 0<I(n)<I max with sharp upper bound I m ax :=min{logr,logs}, the integral may be restricted 
to f 0 max , which shows that the domain of p(/|n) is [0 Jmax]- 

3 But not all priors which one can argue to be non-informative lead to Dirichlet posteriors. Brand 


ra99 (and others), for instance, advocate the entropic prior p(n)cxe H( - 7r \ 
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sample sizes, even if i and j are statistically independent and independence is perfectly 
represented in the data (J(7r) = 0). On the other hand, in this case, the standard deviation 
<7 = yVar(/) rv_/ — rsj E[I} correctly indicates that the mean is still consistent with zero. 

Our approximations (|3|) for the mean and variance are good if — is small. The central 
limit theorem ensures that p(I |n) converges to a Gaussian distribution with mean E[I] 
and variance Var[/]. Since I is non-negative it is more appropriate to approximate p(I\tt) 
as a Gamma (= scaled y 2 ) or log-normal distribution with mean E[I] and variance Var[J], 
which is of course also asymptotically correct. 

A systematic expansion in tie 1 of the mean, variance, and higher moments is possible but 
gets arbitrarily cumbersome. The 0(n~ 2 ) terms for the variance and leading order terms 
for the skewness and kurtosis are given in Section [7]. For the mean it is possible to give 
an exact expression 

E[I] = - + 1) - i>{n i+ + 1) - if>{n +j + 1) + ^{n + 1)] (4) 

71 ij 

with i])(n+l) = — 7 +IZfc=i| —log w +0(-) for integer n. See Section || for details and more 
general expressions for ^ for non-integer arguments. 

There may be other prior information available which cannot be comprised in a Dirichlet 
distribution. In this general case, the mean and variance of / can still be related to the 
covariance structure of p(7r|n), which will be done in the following Section. 


4 Approximation of Expectation and Variance of I 

In the following let fly,- :—E[nij\. Since p(vr|n) is strongly peaked around 7r = 7r for large 
n we may expand /( n) around it in the integrals for the mean and the variance. With 
Aij-.—TTij—itij and using V ;/ ~'/ * VyVy we S e ^ f° r the expansion of (H) 


/(tt) = /(tt) + ^log 


7Tj. 
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K 7tj + 7T + j ^ 
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Taking the expectation, the linear term E[A tJ ] = 0 drops out. The quadratic terms 
E[AijAki] — Coy(TTij,7Tki) are the covariance of i r under distribution p(7r|n) and are pro¬ 
portional to rC 1 . It can be shown that i?[A 3 ]~n -2 (see Section [7]). 


E\I\ = I (ft) + Cov(7r ii ,7r fc/ ) + 0(n 2 ). 

2 ijkl V ~U ~-iJ 


( 6 ) 


in n 1 is 


The Kronecker delta 5ij is 1 for i—j and 0 otherwise. The variance of / in leading order 
Var/(7r) = E[{I-E[I]f} A E (£log 


7T, 
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where = means = up to terms of order n~ 2 . So the leading order variance and the leading 
and next to leading order mean of the mutual information /(7r) can be expressed in terms 
of the covariance of n under the posterior distribution p(7r|n). 


5 The Second Order Dirichlet Distribution 


Noninformative priors for p(n) are commonly used if no additional prior information is 
available. Many non-informative choices (uniform, Jeffreys’, Haldane’s, Perks’, ... prior) 
lead to a Dirichlet posterior distribution: 


p(7r|n) 

N( n) 



with normalization 

Tip r(njj) 

r(n) ’ 


( 8 ) 


where T is the Gamma function, and n l3 — n 1 - +n", where n 1 - are the number of samples 
and n![ 3 comprises prior information (1 for the uniform prior, | for Jeffreys’ prior, 0 
for Haldane’s prior, — for Perks’ prior). Mean and covariance of p(7r|n) are 


n, 


vr, 




:= E[TTij] = —, Cov(7Tp,7r H ) = 


1 


n 


n + 1 


{ftijSikdjl TCkl) 


( 9 ) 


Inserting this into H) and (^) we get after some algebra for the mean and variance of the 
mutual information I{ji) up to terms of order n~ 2 \ 


E[I\ 

Var [/] 
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( 10 ) 

(11) 

( 12 ) 

(13) 


J and K (and L, M, P, Q defined later) depend on ftp = ^ only, i.e. are 0(1) in n. 
Strictly speaking we should expand Er = i.e. drop the +1, but the exact 

expression ([|) for the covariance suggests to keep the +1. We compared both versions 
with the exact values (from Monte-Carlo simulations) for various parameters 7r. In most 
cases the expansion in —j-r was more accurate, so we suggest to use this variant. 
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6 Exact Value for E[I] 


It is possible to get an exact expression for the mean mutual information E[I] under 
the Dirichlct distribution. By noting that x\ogx — -^x^\p=i, {x — {nij,n i+ ,TT + j}), one can 
replace the logarithms in the last expression of (|l|) by powers. From (|8|) we see that 
E^f] = . Taking the derivative and setting ,3=1 we get 

E[nij log 7Tij\ = jpE[( + 1) - ^{n + 1)]. 


The t/i function has the following properties (see ||AS74|| for details) 


cllogT(z) T'(z) . 1 

^ —di—= r^’ «.+ !) = log* + s 


1 + 0 ( 4 ), 


12z 2 


n —1 i n I 

^W = ^7 + Eri ${n+\) = -7 + 21og2 + 2^ . (14) 

fc=l k =1 

The value of the Euler constant 7 is irrelevant here, since it cancels out. Since the 
marginal distributions of 77 + and 7T + j are also Dirichlet (with parameters 77 + and n + j) we 
get similarly 

E[TT i+ log 77+] = i^)n i+ [V’(n i+ + 1)--0(n +1)], 

^ i 

E[n +J logn +j ] = -^2n +j [^(n +j + 1) - ^{n + 1)]. 
n 3 

Inserting this into ([I]) and rearranging terms we get the exact expression^ 

E[I] = -J2 + !) - ^( n i+ + !) - V>( n +j + 1) + + 1)] (15) 

n ij 


For large sample sizes, ^(z + l)~logz an d (pT|) approaches the frequency estimate I (ft) 
as it should be. Inserting the expansion ^(z + l) = log£ + 4-+... into (|l5| ) we also get the 
correction term of (|j). 


The presented method (with some refinements) may also be used to determine an exact 
expression for the variance of /(7r). All but one term can be expressed in terms of Gamma 
functions. The final result after differentiating w.r.t. (5\ and @2 can be represented in 
terms of 7/1 and its derivative ijj'. The mixed term E[(-K i+ Y 1 ('K + j) 13 ' 2 ] is more complicated 
and involves confluent hypergeometric functions, which limits its practical use ||W W93| . 


4 This expression has independently been derived in [WW93]. 
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7 Generalizations 


A systematic expansion of all moments of p(I |n) to arbitrary order in n -1 is possible, but 
gets soon quite cumbersome. For the mean we already gave an exact expression (|15|), so 
we concentrate here on the variance, skewness and the kurtosis of p(/|n). The 3 rd and A th 
central moments of rc under the Dirichlet distribution are 


E[A a A b A c ] = 


(n + l)(n + 2) 


[2k a 7r{,7r c H a TT b b b c k^7T c (5 ca ^(AnAah T ^a^ab^bc\ ( Hi) 


E[A a A b A c A d \ 


2 | -’EfAb^fEd '^c’^dEa^/ih "h"d7}aSac '^b^c^a&ad (If) 

^ aftdftbObc ^ c^b^bd ^a^b^c@cd 


~^~^a^c^ab^cd T ^lAlfia,cA>d T '^a^b^ad^bc\ T 0{jl ) 


with a — ij, b — kl,... e {l,...,r} x {l,...,s} being double indices, S ab — 6^6^,... h 13 — 
Expanding A fc = (7r —7r) fc in A[A a A;,...] leads to expressions containing E[7T a 7T b ...], which 
can be computed by a case analysis of all combinations of equal/unequal indices a,b,c,... 
using (|j). Many terms cancel leading to the above expressions. They allow to compute 
the order n~ 2 term of the variance of 1(h). Again, inspection of (JT^) suggests to expand 
in [(n+l)(n+2)] _1 , rather than in n~ 2 . The variance in leading and next to leading order 
is 


Var[/] 
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(18) 

(19) 

( 20 ) 


J and K are defined in ([T2|) and (p~3|) . Note that the first term K n +i also contains second 
order terms when expanded in n _1 . The leading order terms for the 3 rd and A th central 
moments ofp(/|n) are 


E[(I - E[I]) 3 } 
L 
Ji+ 

E[(I-E[I ]) 4 ] 


2 : [2J 3 - 3 KJ + L\ + — \K + J 2 - P] + 0(n" 3 ), 
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n“ 


E 
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3 

n 2 
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AL 
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log 


riijn 

n i+n+j j 


Uij log Uijn 
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[K 


Tli+TL+j 
J 2 ] 2 + 0(n“ 3 ) 


n 7 2 

p ■= ^2 l+ 


n. 


n J 2 ■ 

UJ +3 


■i+ 
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+3 


■hi ■■= y 


Tlij TiijTl 

n rii + n + j 


from which the skewness and kurtosis can be obtained by dividing by Var[/] 3 / 2 and Var [I] 2 
respectively. One can see that the skewness is of order n and the kurtosis is 3+0(n^ 1 ). 
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Significant deviation of the skewness from 0 or the kurtosis from 3 would indicate a non- 
Gaussian I. They can be used to get an improved approximation for p(I |n) by making, 
for instance, an ansatz 

p{I |n) oc (1 + bl + cl 2 ) ■ p 0 (/|/i, a 2 ) 

and fitting the parameters 6, c, fi, and a 2 to the mean, variance, skewness, and kurtosis 
expressions above, po is the Normal or Gamma distribution (or any other distribution with 
Gaussian limit). From this, quantiles p(I >/*|n) := fffp(I\n) d/, needed in ||fvJ 96| , |Klc99|| , 
can be computed. A systematic expansion of arbitrarily high moments to arbitrarily high 
order in n~ l leads, in principle, to arbitrarily accurate estimates. 


|KJ96|, |Klc99| 


8 Numerics 


There are short and fast implementations of if. 

FFTV92I 


The code of the Gamma function in 
for instance, can be modified to compute the if function. For integer and 
half-integer values one may create a lookup table from (0)- The needed quantities J, 
K, L, M, and Q (depending on n) involve a double sum, P only a single sum, and the 
r + s quantities Jj + and J +3 also only a single sum. ffence, the computation time for the 
(central) moments is of the same order O(r-s) as for the point estimate (|T|). “Exact” 
values have been obtained for representative choices of 7r^, r, s, and n by Monte Carlo 
simulation. The 7 ly,- :=Xij/x ++ are Dirichlet distributed, if each x i3 follows a Gamma 
distribution. See |[PFTV92|j how to sample from a Gamma distribution. The variance has 
been expanded in so the relative error Va dha^-Var[/] e;caet of the approximation (JTT|) 
and (|I8|) are of the order of — and (yf ) 2 respectively, if i and j are dependent. If they are 
independent the leading term ([□]) drops itself down to order n~ 2 resulting in a reduced 


relative accuracy Oftff) of (|T8|) . Comparison with the Monte Carlo values confirmed an 
accurracy in the range (—) 1 - 2 . The mean (|) is exact. Together with the skewness and 
kurtosis we have a good description for the distribution of the mutual information p(/|n) 
for not too small sample bin sizes n l3 . We want to conclude with some notes on useful 
accuracy. The hypothetical prior sample sizes = {0,^^^!} can all be argued to be 


non-informative [pCSR95|] . Since the central moments are expansions in n 1 , the next 


to leading order term can be freely adjusted by adjusting n" G [0...1]. So one may argue 
that anything beyond leading order is free to will, and the leading order terms may be 
regarded as accurate as we can specify our prior knowledge. On the other hand, exact 
expressions have the advantage of being safe against cancellations. For instance, leading 
order of E[I] and E[I 2 ] does not suffice to compute the leading order of Var[/]. 
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